Artificial Intelligence in the Life Sciences — Latest Matching Preprints

1

Is brightfield all you need for mechanism of action prediction?

Gupta, A.; Harrison, P. J.; Wieslander, H.; Rietdijk, J.; Carreras-Puigvert, J.; Georgiev, P.; Wahlby, C.; Spjuth, O.; Sintorn, I.-M.

2022-10-13 bioinformatics 10.1101/2022.10.12.511869 medRxiv

Top 0.1%

13.2%

Show abstract

Fluorescence staining techniques, such as Cell Painting, together with fluorescence microscopy have proven invaluable for visualizing and quantifying the effects that drugs and other perturbations have on cultured cells. However, fluorescence microscopy is expensive, time-consuming, and labor-intensive, and the stains applied can be cytotoxic, interfering with the activity under study. The simplest form of microscopy, brightfield microscopy, lacks these downsides, but the images produced have low contrast and the cellular compartments are difficult to discern. Nevertheless, by harnessing deep learning, these brightfield images may still be sufficient for various predictive purposes. In this study, we compared the predictive performance of models trained on fluorescence images to those trained on brightfield images for predicting the mechanism of action (MoA) of different drugs. We also extracted CellProfiler features from the fluorescence images and used them to benchmark the performance. Overall, we found comparable and correlated predictive performance for the two imaging modalities. This is promising for future studies of MoAs in time-lapse experiments.

2

Mutation Pathogenicity Prediction by a Biology Based Explainable AI Multi-Modal Algorithm

Kellerman, R.; Nayshool, O.; Barel, O.; Paz, S.; Amariglio, N.; Klang, E.; Rechavi, G.

2024-06-05 genetic and genomic medicine 10.1101/2024.06.05.24308476 medRxiv

Top 0.1%

12.6%

Show abstract

Most known pathogenic mutations occur in protein-coding regions of DNA and change the way proteins are made. Deciphering the protein structure therefore provides great insight into the molecular mechanisms underlying biological functions in human disease. While there have recently been major advances in the artificial intelligence-based prediction of protein structure, the determination of the biological and clinical relevance of specific mutations is not yet up to clinical standards. This challenge is of utmost medical importance when decisions, as critical as suggesting termination of pregnancy or recommending cancer-directed rational drugs, depend on the accuracy of prediction of the effect of the specific mutation. Currently, available tools are aiming to characterize the effect of a mutation on the functionality of the protein according to biochemical criteria, independent of the biological context. A specific change in protein structure can result either in loss of function (LOF) or gain-of-function (GOF) and the ability to identify the directionality of effect needs to be taken into consideration when interpreting the biological outcome of the mutation. Here we describe Triple-modalities Variant Interpretation and Analysis (TriVIAI), a tool incorporating three complementing modalities for improved prediction of missense mutations pathogenicity: protein language model (pLM), graph neural network (GNN) and a tabular model incorporating physical properties from the protein structure. The TriVIAl ensembles predictions compare favorably with the existing tools across various metrics, achieving an AUC-ROC of 0.887, a precision-recall curve (PRC) score of 0.68, and a Brier score of 0.16. The TriVIAI ensemble is also endowed with two major advantages compared to other available tools. The first is the incorporation of biological insights which allow to differentiate between GOF mutations that tend to cluster in specific hotspots and affect structure in a specific functional way versus LOF mutations that are usually dispersed and can cripple the protein in a variety of different ways. Importantly, the advantage over other available tools is more noticeable with GOF mutations as their effect on the protein structure is less disruptive and can be misinterpreted by current variant prioritization strategies. Until now available AI-based pathogenicity predicting algorithms were a black box for the users. The second significant advantage of TriVIAI is the explainability of the ensemble which contrasts the other available AI-based pathogenicity predicting algorithms which constitute a black box for the users. This explainability feature is of major importance considering the clinical responsibility of the medical decision-makers using AI-based pathogenicity predictors.

3

Spherical Phenotype Clustering

Nightingale, L.; Tuersley, J.; Cairoli, A.; Howes, J.; Shand, C.; Powell, A.; Green, D.; Strange, A.; Warchal, S.; Howell, M.

2024-12-05 bioinformatics 10.1101/2024.04.19.590313 medRxiv

Top 0.1%

11.9%

Show abstract

Phenotypic screening experiments produce many microscope images of cells under diverse perturbations, with biologically significant responses often subtle or difficult to identify visually. A central challenge is to extract image representations that distinguish activity from controls and group phenotypically similar perturbations. In this work we propose new adaptations of contrastive loss functions that incorporate experimental metadata as learned class vectors, and a geometrically inspired variant, called SPC, where class vectors are confined to the unit sphere and updated only by attractive terms (allowing more overlap of phenotypically similar classes). The approach is tested on two popular benchmarking datasets, BBBC021 and RxRx3-core; and we also evaluate performance on uncurated screens of HaCaT cells to gauge effectiveness in a realistic use-case scenario. We find we outperform prior methods across the three datasets and on a wide array of metrics measuring phenotype grouping, biological recall, drug-target interaction and mechanism-of-action inference. We also show we maintain this improved performance compared to models over 10x larger in parameter count, and that SPC can be used as an effective fine-tuning technique. The method is easy to implement and is well suited to settings with limited data or compute resources.

4

A Novel Machine Learning Approach Uncovers New and Distinctive Inhibitors for Cyclin-Dependent Kinase 9

Assmann, M.; Bal, M.; Craig, M.; D'Oyley, J.; Phillips, L.; Triendl, H.; Bates, P. A.; Bashir, U.; Ruprah, P.; Shaker, N.; Stojevic, V.

2020-03-19 bioinformatics 10.1101/2020.03.18.996538 medRxiv

Top 0.1%

8.9%

Show abstract

We present a novel combination of generative and predictive machine learning models for discovering unique protein inhibitors. The new method is assessed on its ability to generate unique inhibitors for the cancer associated protein kinase, CDK9. We validate our method by performing biochemical assays, attaining a hit rate of more than 10%, demonstrating the method to be a notable improvement upon a more standard, and somewhat naive approach. Moreover, we imposed the additional challenge of finding inhibitors that are readily synthesized. Importantly, two new inhibitors are found, with one being distinct from reported CDK9 inhibitors. We discuss the results in the context of modern machine learning principles and the desire expressed by the rational drug design community to secure molecules that are structurally different, yet with high binding affinities, to structurally determined protein targets.

5

Machine Learning for Predicting Therapeutic Outcomes in Acute Myeloid Leukemia Patients

Karathanasis, N.; Papasavva, P.; Oulas, A.; Spyrou, G. M.

2024-03-02 genetic and genomic medicine 10.1101/2024.02.29.24303536 medRxiv

Top 0.1%

7.3%

Show abstract

Background and ObjectiveThe standard of care in Acute Myeloid Leukemia patients has remained essentially unchanged for nearly 40 years. Due to the complicated mutational patterns within and between individual patients and a lack of targeted agents for most mutational events, implementing individualized treatment for AML has proven difficult. We reanalysed the BeatAML dataset employing Machine Learning algorithms. The BeatAML project entails patients extensively characterized at the molecular and clinical levels and linked to drug sensitivity outputs. Our approach capitalizes on the molecular and clinical data provided by the BeatAML dataset to predict the ex vivo drug sensitivity for the 122 drugs evaluated by the project. MethodsWe utilized ElasticNet, which produces fully interpretable models, in combination with a two-step training protocol that allowed us to narrow down computations. We automated the genes filtering step by employing two metrics, and we evaluated all possible data combinations to identify the best training configuration settings per drug. ResultsWe report a Pearson correlation across all drugs of 0.36 when clinical and RNA sequencing data were combined, with the best-performing models reaching a Pearson correlation of 0.67. When we trained using the datasets in isolation, we noted that RNA Sequencing data (Pearson: 0.36) attained three times the predictive power of whole exome sequencing data (Pearson: 0.11), with clinical data falling somewhere in between (Pearson 0.26). Lastly, we present a paradigm of clinical significance. We used our models prediction as a health management score to rank an individuals expected response to treatment. We identified 78 patients out of 89 (88%) that the proposed drug was more potent than the administered one based on their ex vivo drug sensitivity data. ConclusionsIn conclusion, our reanalysis of the BeatAML dataset using Machine Learning algorithms demonstrates the potential for individualized treatment prediction in Acute Myeloid Leukemia patients, addressing the longstanding challenge of treatment personalization in this disease. By leveraging molecular and clinical data, our approach yields promising correlations between predicted drug sensitivity and actual responses, highlighting a significant step forward in improving therapeutic outcomes for AML patients. HighlightsO_LIMachine learning can predict response to treatment in Acute Myeloid Leukemia patients. C_LIO_LIRNA sequencing data are more informative than whole exome sequencing and clinical data in predicting drug response in Acute Myeloid Leukemia patients. C_LIO_LIDrug response predictions could be used as a health management score to rank the individuals expected response to treatment. C_LIO_LIWe identified a more potent drug than the administered one for 88% (78 out of 89) of the patients examined. C_LI

6

Improving Alphafold2 Performance With A Global Metagenomic & Biological Data Supply Chain

Munsamy, G.; Bohnuud, T.; Lorenz, P.

2024-03-06 genomics 10.1101/2024.03.06.583325 medRxiv

Top 0.1%

7.3%

Show abstract

Scaling laws suggest that more than a trillion species inhabit our planet but only a miniscule and unrepresentative fraction (less than 0.00001%) have been studied or sequenced to date. Deep learning models, including those applied to tasks in the life sciences, depend on the quality and size of training or reference datasets. Given the large knowledge gap we experience when it comes to life on earth, we present a data-centric approach to improving deep learning models in Biology: We built partnerships with nature parks and biodiversity stakeholders across 5 continents covering 50% of global biomes, establishing a global metagenomics and biological data supply chain. With higher protein sequence diversity captured in this dataset compared to existing public data, we apply this data advantage to the protein folding problem by MSA supplementation during inference of AlphaFold2. Our model, BaseFold, exceeds traditional AlphaFold2 performance across targets from the CASP15 and CAMEO, 60% of which show improved pLDDT scores and RMSD values being reduced by up to 80%. On top of this, the improved quality of the predicted structures can yield better docking results. By sharing benefits with the stakeholders this data originates from, we present a way of simultaneously improving deep learning models for biology and incentivising protection of our planets biodiversity.

7

A deep-learning based analysis framework for ultra-high throughput screening time-series data

Balzerowski, P.; Gnutt, D.; Hebing, L.; Lima, F. d. A. e.; Manesso, E.; Mueller, T.; Diedam, H.

2024-08-22 bioinformatics 10.1101/2024.08.22.609110 medRxiv

Top 0.1%

6.8%

Show abstract

Analysis of ultra-high-throughput screening data sets is a highly critical step in drug discovery campaigns. Due to various environmental and experimental error sources fast and reliable identification of possible candidate compounds is challenging. In this work, we introduce a novel deep-learning based analysis framework to analyze uHTS time-series data sets. Our framework is based on two independent deep-learning models. A deep-learning regression model reduces temporal and spatial signal variation across multitier plates caused by systematic and random errors and a separate variational autoencoder model is used for dimensionality reduction. In contrast to classical evaluation methods our approach is capable to derive lower dimensional representations of time-series signals without a-priori knowledge of the data generating mechanism. We tested our analysis framework on an experimental uHTS data set and identified two distinct classes of substances in the screened library which could be attributed to two biological modes of action. Selected substances belonging to both modes of action were successfully validated in a secondary screening experiment.

8

deepBlastoid: A Deep Learning-Based High-Throughput Classifier for Human Blastoids Using Brightfield Images with Confidence Assessment

Fan, Z.; Li, Z.; Jin, Y.; Chandrasekaran, A. P.; Shakir, I. M.; Zhang, Y.; Siddique, A.; Wang, M.; zhou, X.; Tian, Y.; Wonka, P.; Li, M.

2024-12-09 developmental biology 10.1101/2024.12.05.627041 medRxiv

Top 0.1%

6.8%

Show abstract

Recent advances in human blastoids have opened new avenues for modeling early human development and implantation. Human blastoids can be generated in large numbers, making them suitable for high-throughput screening, which often involves analyzing vast numbers of images. However, automated methods for evaluating and characterizing blastoid morphology are still underdeveloped. We developed a deep-learning model capable of recognizing and classifying blastoid brightfield images into five distinct quality categories. The model processes 53.2 images per second with an average accuracy of 87%, without signs of overfitting or batch eHects. By integrating a Confidence Rate (CR) metric, the accuracy was further improved to 97%, with low-CR images flagged for human review. In a comparison with human experts, the model matched their accuracy while significantly outperforming them in throughput. We demonstrate the value of the model in two real-world applications: (1) systematic assessment of the eHect of lysophosphatidic acid (LPA) concentration on blastoid formation, and (2) evaluating the impact of dimethyl sulfoxide (DMSO) on blastoids for drug screening. In the applications involving over 10,000 images, the model identified significant eHects of LPA and DMSO, which may have been overlooked in manual assessments. The deepBlastoid model is publicly available and researchers can train their own model according to their imaging conditions and blastoid culture protocol. deepBlastoid thus oHers a precise, automated approach for blastoid classification, with significant potential for advancing mechanism research, drug screening, and clinical in vitro fertilization (IVF) applications.

9

EmbryoTempoFormer: clip-based developmental tempo inference from zebrafish brightfield time-lapse microscopy

Deng, L.; Lin, P.; Xie, L.

2026-03-11 developmental biology 10.64898/2026.03.09.710433 medRxiv

Top 0.1%

6.6%

Show abstract

Nominal hours post fertilization (hpf) are widely used to index zebrafish embryogenesis, yet under condition shifts--such as temperature change, genetic perturbation, or environmental stress--nominal time can decouple from true developmental progression. In such settings, biologically meaningful variation is better described as a systematic change in developmental tempo rather than a simple temporal offset. Here we introduce an embryo-resolved framework that treats developmental tempo as the primary quantity of interest in brightfield time-lapse imaging. We present EmbryoTempoFormer (ETF), a clip-based CNN-Transformer that predicts developmental progression from short time-lapse clips and is trained with a within-embryo temporal-difference consistency regularizer to promote temporally coherent trajectories. Crucially, we couple model predictions with an embryo-level inference and statistical workflow: temporally correlated clip-level outputs are aggregated into interpretable embryo-level tempo and stability readouts, and cross-condition effects are quantified using embryo-bootstrap confidence intervals with embryos--rather than frames or clips--as independent units, avoiding pseudo-replication. Using temperature perturbation as a representative domain shift, we robustly quantify condition-induced changes in global developmental dynamics and show that developmental delay predominantly manifests as reduced developmental tempo. This framework enables statistically principled, high-throughput phenotyping for perturbation screens, drug assays, and environmental stress studies. HIGHLIGHTSO_LIClip-based CNN-Transformer predicts developmental time from brightfield time-lapse microscopy. C_LIO_LIWithin-embryo temporal-difference consistency improves trajectory self-consistency. C_LIO_LIEmbryo-level anchored tempo slopes enable interpretable cross-condition comparisons. C_LIO_LIReproducible pipeline via code, scripts, and a Zenodo bundle with embryo-level inference C_LI Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=80 SRC="FIGDIR/small/710433v1_ufig1.gif" ALT="Figure 1"> View larger version (21K): org.highwire.dtl.DTLVardef@19e4028org.highwire.dtl.DTLVardef@8622b9org.highwire.dtl.DTLVardef@51b78borg.highwire.dtl.DTLVardef@e93c2e_HPS_FORMAT_FIGEXP M_FIG C_FIG

10

MorphoDiff: Cellular Morphology Painting with Diffusion Models

Navidi, Z.; Ma, J.; Miglietta, E. A.; Liu, L.; Carpenter, A. E.; Cimini, B. A.; Haibe-Kains, B.; Wang, B.

2024-12-20 bioinformatics 10.1101/2024.12.19.629451 medRxiv

Top 0.1%

6.2%

Show abstract

Understanding cellular responses to external stimuli is critical for parsing biological mechanisms and advancing therapeutic development. High-content image-based assays provide a cost-effective approach to examine cellular phenotypes induced by diverse interventions, which offers valuable insights into biological processes and cellular states. In this paper, we introduce MorphoDiff, a generative pipeline to predict high-resolution cell morphological responses under different conditions based on perturbation encoding. To the best of our knowledge, MorphoDiff is the first framework capable of producing guided, high-resolution predictions of cell morphology that generalize across both chemical and genetic interventions. The model integrates perturbation embeddings as guiding signals within a 2D latent diffusion model. The comprehensive computational, biological, and visual validations across three open-source Cell Painting datasets show that MorphoDiff can generate high-fidelity images and produce meaningful biology signals under various interventions. We envision the model will facilitate efficient in silico exploration of perturbational landscapes towards more effective drug discovery studies.

11

Clinical Advancement Forecasting

Czech, E. A.; Wojdyla, R. S.; Himmelstein, D. S.; Frank, D. H.; Miller, N. A.; Milwid, J. M.; Kolom, A.; Hammerbacher, J.

2024-08-03 genetic and genomic medicine 10.1101/2024.08.02.24311422 medRxiv

Top 0.1%

6.0%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWChoosing which drug targets to pursue for a given disease is one of the most impactful decisions made in the global development of new medicines. This study examines the extent to which the outcomes of clinical trials can be predicted based on a small set of longitudinal (temporally labeled) evidence and properties of drug targets and diseases. We demonstrate a novel statistical learning framework for identifying the top 2% of target-disease pairs that are as much as 4-5x more likely to advance beyond phase 2 trials. This framework is 1.5-2x more effective than an Open Targets composite score based on the same set of evidence. It is also 2x more effective than a common measure for genetic support that has been observed previously, as well as in this study, to confer a 2x higher likelihood of success. Utilizing a subset of our biomedical evidence base, non-negative linear models resulting from this framework can produce simple weighting schemes across various types of human, animal, and cell model genomic, transcriptomic, proteomic, and clinical evidence to identify previously undeveloped target-disease pairs poised for clinical success. In this study we further explore: i) how longitudinal treatment of evidence relates to leakage and reverse causality in biomedical research and how temporalized evidence can mitigate common forms of potential biases and inflation ii) the relative impact of different types of features on our predictions; and iii) an analysis of the space of currently undeveloped, tractable targets predicted with these methods to have the highest likelihood of clinical success. To ease reproduction and deployment, no data is used outside of Open Targets and the described methods require no expert knowledge, and can support expansion of lines of evidence to further improve performance.

12

Small molecule bioactivity benchmarks are often well-predicted by counting cells

Seal, S.; Dee, W.; Shah, A.; Zhang, A.; Titterton, K.; Cabrera, A. A.; Boiko, D.; Beatson, A.; Puigvert, J. C.; Singh, S.; Spjuth, O.; Bender, A.; Carpenter, A. E.

2025-04-30 bioinformatics 10.1101/2025.04.27.650853 medRxiv

Top 0.1%

5.6%

Show abstract

Phenotypic profiling methods, such as Cell Painting and gene expression, have been widely used to predict compound bioactivity, often showing improvement over predictive models based on chemical structures alone. We discovered that a large subset of assays in widely-used benchmark datasets either directly relate to cell health and cytotoxicity or are assays intending to capture a more specific phenotype but whose active compounds impact cell count, while inactives do not. As a result, counting cells can achieve similar predictive performance as Cell Painting or gene expression data. Filtering benchmarks to include only assays relating to protein targets reveals that Cell Painting can capture information that cannot be predicted by mere cell counting. We re-evaluated three benchmark datasets used with Cell Painting data and observed that, in many cases, cell count models produced an AUC comparable to models using the full Cell Painting profiles. However, in protein-target-specific benchmarks across 17 distinct protein targets, Cell Painting features demonstrated unique predictive power, outperforming mean balanced accuracy from cell count models with a relative improvement of 19.6%. We propose five practical recommendations for benchmarking machine learning models for predicting bioactivity, including using cell count as a baseline feature. Although multi-class classification applications (such as matching samples based on their morphological profile) are less likely to be predictable by cell count than bioactivity benchmarks, these recommendations are broadly applicable to machine learning for drug discovery.

13

MOAST: Mechanism of Action Similarity Tool

Lohith, A.; Terciano, D.; Murray, A.; MacMillan, J.; Lokey, S.

2025-09-19 bioinformatics 10.1101/2025.09.15.676411 medRxiv

Top 0.1%

5.4%

Show abstract

Determining the mechanism of action (MOA) for natural products remains a significant bottleneck in drug discovery, particularly for researchers with limited computational resources or small compound libraries. Traditional approaches require screening large numbers of annotated compounds alongside unknowns, which is cost-prohibitive, or depend on complex machine learning models that need substantial computational resources and large datasets. Here, we present a dissertation chapter excerpt: MOAST (Mechanism of Action Similarity Tool), a BLAST-inspired computational workflow that addresses these limitations by providing rapid MOA hypotheses for newly screened compounds. This chapter investigates two complementary approaches: a kernel density estimation (KDE) method providing statistical significance measures and E-values for MOA class membership, and a CatBoost machine learning classifier for multi-class prediction with ranked outputs. Using cytological profiling data from HeLa and A549 cell lines, MOAST achieved 22% accuracy for the top 5 predictions among [~] 300 MOA classes, with the CatBoost classifier reaching 10% balanced accuracy--significantly better than the [~] 3% reported in literature. The tool suggests a 0.8 prediction probability threshold for trustworthy results and demonstrates robust performance across multiple feature reduction strategies. MOAST provides a practical, accessible solution that bridges traditional phenotypic screening and modern computational approaches, making MOA determination feasible for researchers with limited resources while maintaining statistical rigor and interpretability.

14

A RAG Chatbot for Precision Medicine of Multiple Myeloma

Quidwai, M. A.; Lagana, A.

2024-03-18 genetic and genomic medicine 10.1101/2024.03.14.24304293 medRxiv

Top 0.1%

5.4%

Show abstract

The advent of precision medicine has revolutionized cancer treatment by integrating individual genetic, lifestyle, and environmental factors to tailor patient care (Huang et al., 2020; Ginsburg and Phillips, 2018). However, the complexity and heterogeneity of diseases like Multiple Myeloma (MM) pose significant challenges in leveraging the vast amounts of genomic data and biomedical literature available for personalized treatment planning (Rajkumar, 2014; Rollig et al., 2015). To address this, we present an innovative Retrieval-Augmented Generation (RAG) based chatbot framework that harnesses the power of Natural Language Processing (NLP) and state-of-the-art language models to curate and analyze MM-specific literature and provide personalized treatment recommendations based on patient-specific genomic data (Lewis et al., 2020). Our framework integrates the BioMed-RoBERTa-base model for embedding generation (Gururangan et al., 2020) and the Mistral-7B language model for question answering (Anthropic, 2023), enabling effective understanding and response to complex clinical queries. The retrieval component is enhanced by Amazon OpenSearch Service, ensuring fast and accurate access to relevant information. A comprehensive data analysis pipeline, including exploratory data analysis, semantic search, clustering, and topic modeling, provides valuable insights into the MM research landscape, informing the chatbots knowledge base and uncovering potential research directions (Blei et al., 2003; Mikolov et al., 2013). Deployed using Amazon Kendra, our RAG chatbot offers a user-friendly and scalable platform for accessing MM information, incorporating features such as user authentication, customizable web interface, and continuous improvement based on user feedback. The framework aims to democratize access to precision medicine by providing clinicians with a sophisticated tool for interpreting complex genomic data in the context of MM, streamlining clinical workflows, and facilitating the development of personalized treatment plans (Patel et al., 2015). This paper presents the conceptualization, development, and potential impact of our RAG-based chatbot framework on the landscape of MM treatment and precision medicine. We argue that the synergistic integration of AI, NLP, and domain-specific knowledge marks a new era of healthcare, characterized by highly personalized, data-driven, and effective treatment modalities (Thong et al., 2021). Our framework not only advances the field of precision medicine in MM but also serves as a blueprint for the development of similar systems in other complex diseases, ultimately improving patient outcomes and quality of life.

15

In Silico Driven Prediction of MAPK14 Off-Targets Reveals Unrelated Proteins with High Accuracy

Kaiser, F.; Plach, M. G.; Leberecht, C.; Schubert, T.; Haupt, V. J.

2020-07-24 bioinformatics 10.1101/2020.07.24.219071 medRxiv

Top 0.1%

5.4%

Show abstract

During the discovery and development of new drugs, candidates with undesired and potentially harmful side-effects can arise at all stages, which poses significant scientific and economic risks. Most of such phenotypic side-effects can be attributed to binding of the drug candidate to unintended proteins, so-called off-targets. The early identification of potential off-targets is therefore of utmost importance to mitigate any downstream risks. We showcase how the combination of knowledge-based in silico off-target screening and state-of-the-art biophysics can be applied to rapidly identify off-targets for a MAPK14 inhibitor. Out of 13 predicted off-targets, six proteins were confirmed to interact with the inhibitor in vitro, which translates to an exceptional hit rate of 46%. For two proteins, affinities in the lower micromolar range were obtained: The kinase IRE1 and the Hematopoietic Prostaglandin D Synthase, which is entirely unrelated to MAPK14 and is involved in different cell-regulatory processes. The whole off-target identification/validation pipeline can be completed as fast as within two months, excluding delivery times of proteins. These results emphasize how computational off-target screening in combination with MicroScale Thermophoresis can effectively reduce downstream development risks in a very competitive time frame and at low cost.

16

A Multimodal Foundation Model for Discovering Genetic Associations with Brain Imaging Phenotypes

Machado Reyes, D.; Burch, M. C.; PARIDA, L.; Bose, A.

2024-11-04 genetic and genomic medicine 10.1101/2024.11.02.24316653 medRxiv

Top 0.1%

5.3%

Show abstract

Due to the intricate etiology of neurological disorders, finding interpretable associations between multi-omics features can be challenging using standard approaches. We propose COMICAL, a contrastive learning approach leveraging multi-omics data to generate associations between genetic markers and brain imaging-derived phenotypes. COMICAL jointly learns omic representations utilizing transformer-based encoders with custom tokenizers. Our modality-agnostic approach uniquely identi-fies many-to-many associations via self-supervised learning schemes and cross-modal attention encoders. COMICAL discovered several significant associations between genetic markers and imaging-derived phenotypes for a variety of neurological disorders in the UK Biobank as well as predicting across diseases and unseen clinical outcomes from the learned representations. Source code of COMICAL along with pre-trained weights, enabling transfer learning is available at https://github.com/IBM/comical.

17

Comparative analysis of molecular representations in prediction of drug combination effects

Zagidullin, B.; Wang, Z.; Guan, Y.; Pitkänen, E.; Tang, J.

2021-06-04 bioinformatics 10.1101/2021.04.16.439299 medRxiv

Top 0.1%

5.2%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWApplication of machine and deep learning methods in drug discovery and cancer research has gained a considerable amount of attention in the past years. As the field grows, it becomes crucial to systematically evaluate the performance of novel computational solutions in relation to established techniques. To this end we compare rule-based and data-driven molecular representations in prediction of drug combination sensitivity and drug synergy scores using standardized results of 14 throughput screening studies, comprising 64 200 unique combinations of 4 153 molecules tested in 112 cancer cell lines. We evaluate the clustering performance of molecular representations and quantify their similarity by adapting the Centered Kernel Alignment metric. Our work demonstrates that to identify an optimal molecular representation type it is necessary to supplement quantitative benchmark results with qualitative considerations, such as model interpretability and robustness, which may vary between and throughout preclinical drug development projects. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=168 SRC="FIGDIR/small/439299v2_ufig1.gif" ALT="Figure 1"> View larger version (41K): org.highwire.dtl.DTLVardef@1bc4509org.highwire.dtl.DTLVardef@1586640org.highwire.dtl.DTLVardef@a11129org.highwire.dtl.DTLVardef@6dc93b_HPS_FORMAT_FIGEXP M_FIG C_FIG

18

Completion of the DrugMatrix Toxicogenomics Database using ToxCompl

Cong, G.; Patton, R. M.; Chao, F.; Svoboda, D. L.; Casey, W. M.; Schmitt, C. P.; Murphy, C.; Erickson, J. N.; Combs, P. A.; Auerbach, S. S.

2024-03-29 genomics 10.1101/2024.03.26.586669 medRxiv

Top 0.1%

4.9%

Show abstract

The DrugMatrix Database contains systematically generated toxicogenomics data from short-term in vivo studies for over 600 chemicals. However, most of the potential endpoints in the database are missing due to a lack of experimental measurements. We present our study on leveraging matrix factorization and machine learning methods to predict the missing values in the DrugMatrix, which includes gene expression across eight tissues on two expression platforms along with paired clinical chemistry, hematology, and histopathology measurements. One major challenge we encounter is the skewed distribution of the available measured data, in terms of both tissue sources and values. We propose a method, ToxiCompl, that applies systematic hybrid sampling guided by Bayesian optimization in conjunction with low-rank matrix factorization to recover the missing values. ToxiCompl achieves good training and validation performance from a machine learning perspective. We further conduct an in-depth validation of the predicted data from biological and toxicological perspectives with a series of analyses. These include examining the connectivity pattern of predicted gene expression responses, characterizing molecular pathway-level responses from sets of differentially expressed genes, evaluating known transcriptional biomarkers of tissue toxicity, and characterizing pre-dicted apical endpoints. Our analysis shows that the predicted differential gene expression, broadly speaking, aligns with what would be anticipated. For example, in most instances, our predicted differentially expressed gene lists offer a connectivity level comparable to that of measured data in connectivity analysis. Using Havcr1, a known transcriptional biomarker of kidney injury, we identify treatments that, based on the predicted expression data, manifest kidney toxicity in a manner that is mechanistically plausible and supported by the literature. Characterization of the predicted clinical chemistry data suggests that strong effects are relatively reliably predicted, while more subtle effects pose a greater challenge. In the case of histopathological prediction, we find a significant overprediction due to positivity bias in the measured data. Developing methods to deal with this bias is one of the areas we plan to target for future improvement. The main advantage of the ToxiCompl approach is that, in the absence of additional experimental data, it drastically extends the toxicogenomic landscape into a number of data-poor tissues, thereby allowing researchers to formulate mechanistic hypotheses about effects in tissues that have been underrepresented in the literature. All measured and predicted DrugMatrix data (i.e., gene expression, clinical chemistry, hematology, and histopathology) are available to the public through an intuitive GUI interface that allows for data retrieval, gene set analysis and high dimensional visualization of gene expression similarity (https://rstudio.niehs.nih.gov/complete_drugmatrix/).

19

The (α, β)-k Boolean Signatures of Molecular Toxicity: Microcystin as a Case Study

Moscato, P.; Jaeger-Honz, S.; Haque, M. N.; Schreiber, F.

2024-12-29 bioinformatics 10.1101/2024.12.29.630644 medRxiv

Top 0.1%

4.8%

Show abstract

BackgroundThe (, {beta})-k-Feature Set Problem is a combinatorial problem, that has been proven as alternative to typical methods for reducing the dimensionality of large datasets without compromising the performance of machine learning classifiers. ResultWe present a case study that shows that solutions of the (, {beta})-k-Feature Set Problem help to identify molecular substructures related to toxicity. The dataset investigated in this study is based on the inhibition of ser/thr-proteinphosphatases by Microcystin (MC) congeners. MC congeners are a class of structurally similar cyanobacterial toxins, which are critical to human consumption. ConclusionWe show that it is possible to identify biologically meaningful toxicity signatures by applying the (, {beta})-k feature sets on extended connectivity fingerprint representations of MC congeners. Boolean rules were derived from the feature sets to classify toxicity and can be mapped on the chemical structure, leading to insights on the absence/presence of substructures that can explain toxicity. The presented method can be applied on any other molecular data set and is therefore transferrable to other use cases.

20

CuNA: Cumulant-based Network Analysis of genotype-phenotype associations in Parkinson's Disease

Bose, A.; Platt, D. E.; Haiminen, N.; PARIDA, L.

2021-08-05 genetic and genomic medicine 10.1101/2021.08.02.21261457 medRxiv

Top 0.1%

4.8%

Show abstract

Parkinsons Disease (PD) is a progressive neurodegenerative movement disorder characterized by loss of striatal dopaminergic neurons. Progression of PD is usually captured by a host of clinical features represented in different rating scales. PD diagnosis is associated with a broad spectrum of non-motor symptoms such as depression, sleep disorder as well as motor symptoms such as movement impairment, etc. The variability within the clinical phenotype of PD makes detection of the genes associated with early onset PD a difficult task. To address this issue, we developed CuNA, a cumulant-based network analysis algorithm that creates a network from higher-order relationships between eQTLs and phenotypes as captured by cumulants. We also designed a multi-omics simulator, CuNAsim to test CuNAs qualitative accuracy. CuNA accurately detects communities of clinical phenotypes and finds genes associated with them. When applied on PD data, we find previously unreported genes INPP5J, SAMD1 and OR4K13 associated with symptoms of PD affecting the kidney, muscles and olfaction. CuNA provides a framework to integrate and analyze RNA-seq, genotype and clinical phenotype data from complex diseases for more targeted diagnostic and therapeutic solutions in personalized medicine. CuNA and CuNAsim binaries are available upon request.